Teaching data science to undergrads: an ex-Googler’s Reed professor's tales from the trenches.
May 29, 2015
Teaching data science to undergrads: an ex-Googler’s Reed professor's tales from the trenches.
All Reed freshman are required to take Humanities 110.
Only intro stats and some exposure to the statistical software language R. No programming experience necessary.
18 students, mostly juniors and seniors.
| Major | Count |
|---|---|
| Mathematics | 4 |
| Biological Science: Biology & Biochem and Molecular Biology | 4 |
| Other Science: Chemistry, Environmental Studies, Physics | 4 |
| Social Science: Political Science, Sociology | 2 |
| Economics | 2 |
| Misc: Psychology, Linguistics | 2 |
dplyr package for data wrangling/manipulationggplot2 package for data visualizationFeatures
%>% command, pronounced "THEN"Info on all domestic flights leaving Houston (IAH) in 2011:
flights: info on 227,496 flightsplanes: info on 2853 airplanesWhat are the top 5 carriers using the oldest planes (averaged over all flights)?
The flights dataset:
| date | dep | arr | carrier | flight | dest | plane |
|---|---|---|---|---|---|---|
| 2011-01-01 | 1400 | 1500 | AA | 428 | DFW | N576AA |
| 2011-01-02 | 1401 | 1501 | AA | 428 | DFW | N557AA |
| 2011-01-03 | 1352 | 1502 | AA | 428 | DFW | N541AA |
| 2011-01-04 | 1403 | 1513 | AA | 428 | DFW | N403AA |
| 2011-01-05 | 1405 | 1507 | AA | 428 | DFW | N492AA |
The planes dataset:
| plane | year | model | mfr | no.seats |
|---|---|---|---|---|
| N576AA | 1991 | DC-9-82(MD-82) | MCDONNELL DOUGLAS | 172 |
| N557AA | 1993 | KITFOX IV | MARZ BARRY | 2 |
| N403AA | 1974 | S55A | RAVEN | 1 |
| N492AA | 1989 | DC-9-82(MD-82) | MCDONNELL DOUGLAS | 172 |
| N262AA | 1985 | DC-9-82(MD-82) | MCDONNELL DOUGLAS | 172 |
The following sequence of verbs wrangle/manipulate the data:
left_join(flights, planes, by='plane') %>% select(carrier, plane, year) %>% mutate(age = 2011 - year) %>% group_by(carrier) %>% summarise(avg_age = mean(age)) %>% arrange(desc(avg_age)) %>% top_n(5)
| carrier | avg_age |
|---|---|
| MQ | 29.421 |
| AA | 24.325 |
| DL | 20.760 |
| US | 19.078 |
| UA | 14.635 |
-A statistical graphic consists of a mapping of data variables to aesthetic attributes of geometric objects that we can observe. -
ggplot2allows us to construct graphics in a modular fashion by specifying the elements of the grammar.
Minard's map of Napoleon's Russian campaign of 1812:
| Data (Variable) | Geometric Object | Aesthetic Attribute of Geo Obj |
|---|---|---|
| longitude | points | x position |
| latitude | points | y position |
| army size | bars | width |
| army direction | bars | color |
| date | text | (x,y) position |
| temperature | lines | (x,y) position |
All 222,540 songs played on the Reed pool hall jukebox from 2003-2009 c/o Noah Pepper '09
| date_time | artist | album | track |
|---|---|---|---|
| Sun Dec 7 05:12:57 2003 | Tom Petty and the Heartbreakers | Into the Great Wide Open | |
| Sun Dec 7 05:15:56 2003 | Jefferson Airplane | Somebody To Love | |
| Sun Dec 7 05:23:04 2003 | Led Zeppelin | Led Zeppelin IV | 08 When The Levee Breaks |
quandl.com is a great source for economic data
"A web application framework for R. Turn your analyses into interactive web applications. No HTML, CSS, or JavaScript knowledge required."
This is the only stats class many will take.